The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator.
Two families of ensemble methods are usually distinguished:
Averaging methods: Here the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced. Example: Random forest
Boosting methods: Here base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble. Example: Adaboost
In random forests, each tree in the ensemble is built from a sample drawn with replacement from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.
High level module built on NumPy, SciPy, and matplotlib. Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency.
In [ ]:
import pandas as pd
import numpy as np
TRAIN_CSV = "C:\\Users\kmpoo\Dropbox\HEC\Teaching\Python for PhD May 2019\python4phd\Session 3\Sent\sentence_review.csv"
dataframe = pd.read_csv(TRAIN_CSV, sep=",",error_bad_lines=False,header= 0, low_memory=False, encoding = "Latin1")
print(dataframe)
In [ ]:
dataframe = dataframe.assign(nWords = lambda x : x['review_text'].str.split().str.len() )
dataframe['bi_senti'] = [ "positive" if x >= 4 else "negative" for x in dataframe['sentiment']]
print(dataframe)
print(dataframe['bi_senti'].value_counts())
In [ ]:
from sklearn.utils import shuffle #To shuffle the dataframe
from sklearn.model_selection import train_test_split
dataframe = shuffle(dataframe)
df_train, df_test = train_test_split(dataframe, test_size=0.2)
print("size of trainig data ", len(df_train))
Text data requires special preparation before you can start using it for predictive modeling.
The text must be parsed to remove words, called tokenization. Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization).
tf-idf ( term frequency–inverse document frequency), is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.
In [ ]:
from sklearn.feature_extraction.text import TfidfVectorizer
def tfidf_vectorizer(corpus):
"""This function converts the input sentence into a sparse matrix of tfidf vectors"""
tokenizer = TfidfVectorizer( strip_accents = "unicode", analyzer="word", stop_words="english", ngram_range=(1,2), max_features=20000)
tokenizer.fit(corpus)
return tokenizer
vectorizer = tfidf_vectorizer(dataframe['review_text'])
train_x = vectorizer.transform(df_train['review_text'])
test_x = vectorizer.transform(df_test['review_text'])
print(train_x)
In [ ]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=10)
classifier = rfc.fit(train_x, df_train['bi_senti'])
acc = classifier.score(test_x,df_test['bi_senti'])
print("accuracy of rfc is = ", acc)
rfc.predict_proba(test_x)[0:10]
In [ ]:
s = pd.Series("his movie was absolutely horrible. A boring, random, nonsensical mess from start to finish. The film is incompetently directed from a very poor script. It feels more like a superhero movie from the early 2000's such as Catwoman or Daredevil. Watching it makes it clear that the people involved had no idea what they were doing, and should never have been put in charge of a project this size to begin with. The story makes no sense, and the whole reason Batman wants to kill Superman is contrived. Batman and Superman hate each other because they both cause collateral damage and human death, and neither one ever sees fit to point out their similarities, or try and talk to each other about their different perspectives. Apparently that would have been too interesting, so of course Snyder didn't include it.")
x = vectorizer.transform(s)
print(rfc.predict_proba(x))